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Abstract 

The structure of a Bayesian network encodes most of the information about the prob- 
ability distribution of the data, which is uniquely identified given some general distri- 
butional assumptions. Therefore it's important to study the variability of its network 
structure, which can be used to compare the performance of different learning algorithms 
and to measure the strength of any arbitrary subset of arcs. 

In this paper we will introduce some descriptive statistics and the corresponding para- 
metric and Monte Carlo tests on the undirected graph underlying the structure of a 
Bayesian network, modeled as a multivariate Bernoulli random variable. 

Keywords: Bayesian network, bootstrap, multivariate Bernoulli distribution, structure learn- 
ing algorithms. 



1. Introduction 

In recent years Bayesian networks have been successfully applied in several different disci- 
plines, including medicine, biology and epidemiology (see for example Friedman et al. (2000) 
and Holmes and Jain (2008)). This has been made possible by a rapid evolution of structure 
learning algorithms, both constraint-based (from PC (Spirtes et al. 2001) to Grow-Shrink 
(Margaritis 2003) to IAMB (Tsamardinos et al. 2003) and its variants (Yaramakala and Mar- 
garitis 2005)) and score-based (from Greedy Equivalent Search (Chickering 2002) to genetic 
algorithms (Larranaga et al. 1997)). 

The main goal in the development of these algorithms was the reduction of the number of 
either independence tests or score comparisons needed to learn the structure of the Bayesian 
network. Their correctness was proved assuming either very large sample sizes in relation 
to the number of variables (in the case of Greed Equivalent Search) or the absence of both 
false positives and false negatives (in the case of Grow-Shrink and IAMB). In most cases the 
characteristics of the learned networks were studied using a small number of reference data 
sets (Elidan 2001) as benchmarks, and differences from the true structure measured with 
descriptive measures such as Hamming distance (Jungnickel 2008). 

This approach to model evaluation is not possible for real world data sets, as the true structure 
of their probability distribution is not known in advance. An alternative is provided by the use 
of either parametric or nonparametric bootstrap (Efron and Tibshirani 1993). By applying 
the learning algorithm to a sufficiently large number of bootstrap samples it is possible to 
obtain confidence intervals and empirical probabilities for any feature of the network structure 
(Friedman et al. 1999), such as the presence or the composition of the Markov Blanket of 
a particular node. The fundamental limit in the interpretation of the results is that the 
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"reasonable" level of confidence for thresholding depends on the data. 

In this paper we propose a modified bootstrap-based approach for the inference on the struc- 
ture of a Bayesian network. The undirected graph underlying the network structure is modeled 
as a multivariate Bernoulli random variable in which each component is associated with an 
arc. This assumption allows the derivation of both exact and asymptotic measures of the 
variability of the network structure or its parts. 

2. Bayesian networks and bootstrap 

Bayesian networks are graphical models where nodes represent random variables (the two 
terms are used interchangeably in this article) and arcs represent probabilistic dependencies 
between them (Korb and Nicholson 2004). 

The graphical structure Q = (V, A) of a Bayesian network is a directed acyclic graph (DAG) 
which defines a factorization of the joint probability distribution of V = {Xi,X2, ■ ■ ■ 
often called the global probability distribution, into a set of local probability distributions, one 
for each variable. The form of the factorization is given by the Markov property, which states 
that every random variable Xi directly depends only on its parents Hxi- 

V 

P{Xi, ...,Xy) = Y{ P{Xi I Ux,) (for discrete variables) (1) 

i=l 

V 

f{Xi, . . . , Xy) = JJ f{Xi I IIx,) (for continuous variables). (2) 

1=1 

Therefore it's important to define confidence measures for specific features in the network 
structure, such as the presence of specific configurations of arcs. A related problem is the 
definition of a measure of variability for the network structure as a whole, both as an indi- 
cator of goodness of fit for a particular Bayesian network and as a criterion to evaluate the 
performance of a learning algorithm. 

A possible solution for both these problems has been developed by Friedman et al. (1999) 
using bootstrap simulation, and modified by Imoto et al. (2002) to estimate the confidence in 
the presence of an arc (called edge intensity, and also known as arc strength) and its direction. 
This approach can be summarized as follows: 

1. For b = 1,2, ... ,m 

(a) re-sample a new data set from the original data D using either parametric or 
nonparametric bootstrap. 

(b) learn a Bayesian network Qi, from D^. 

2. Estimate the confidence in each feature / of interest as 

^ m 

P(/) = -E/(^^)- (3) 

6=1 

However, the empirical probabilities P(/) are difficult to evaluate, because the distribution of 
Q is unknown and the confidence threshold value depends on the data. 
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3. The multivariate Bernoulli distribution 



Let Bi,B2, - . . ,Bk, G N be Bernoulli random variables with marginal probability of suc- 
cess pi,p2, ■ . . ,Pk, that is Bi ~ Ber{pi), i = 1, . . . , /c. Then the distribution of the random 
vector B = [Bi, B2, . . . , B^]'^ over the joint probability space of B2, . . . , B^ is a multivari- 
ate Bernoulli random variable (Krummenauer 1998b), denoted as Ber^ip). Its probability 
function is uniquely identified by the parameter collection 



which represents the dependence structure among the marginal distributions in terms of si- 
multaneous successes for every non-empty subset / of elements of the random vector. 

However, several useful results depend only on the first and second order moments of B 



which is in fact used as an approximation of p in the generation random multivariate Bernoulli 
vectors in Krummenauer (1998a). 

3.1. Uncorrelation and independence 

Let's first consider a simple result that links covariance (and therefore correlation) and inde- 
pendence of two univariate Bernoulli variables. 

Theorem 1. Let Bi and Bj be two Bernoulli random variables. Then Bi and Bj are inde- 
pendent if and only if their covariance is zero: 



P = {w:/^{l,...,fe},/^0}, 



(4) 



E(5i) = Pi 
VAR(i?,) = E(i?2)_E(i?,)2=p,-p2 

COy{Bi,Bj) = E{B,Bj) - E{Bi)E{Bj) = pij - pipj 



(5) 
(6) 
(7) 



and the reduced parameter collection 



P = {Pij : j = 1, • • • 



(8) 




(9) 



Proof. If Bi and Bj are independent then by definition 



COV(S„ Bj) = Pij - pipj = P(S, = 1, Bj = 1) - P(S, = l)P{Bj = 1) = 0, 



as P{Bi = l,Bj = 1) = P{Bi = l)P(5j = 1). 

If on the other hand we have that COV(i?j, Bj) = 0, then 



Pij - PiPj = 0^pij = piPj =^ Bi AL Bj 



which completes the proof. 



□ 



This theorem can be extended to multivariate Bernoulli random variables as follows. 
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Theorem 2. Let B = [Bi, B2,..., Bk]'^ and C = [Ci, C2, . . . , Ci]'^ , k,l be two multivari- 
ate Bernoulli random variables. Then B and C are independent if and only if 

BALC^ COV(B, C) = O (10) 

where O is the zero matrix. 

Proof. If B is independent from C, then by definition every pair {Bi,Cj), i = l,...,k, 
j = 1, ... ,1 is independent. Therefore the covariance matrix of B and C is 

coy {Bi, Cj) = dj = ^ COV(B, C) = [cij] = o 

If conversely the covariance matrix COV(B, C) is equal to the zero matrix, every pair {Bi, Cj) 
is independent as 

Cij = Pij - PiPj = pij = piPj 
This imphes the independence of the random vectors B and C, as their sigma- algebras 

a{B) = a{Bi) X . . . X a{Bk) and a{C) = a{Ci) x . . . x a{Ci) 

are functions of the sigma algebras induced by the two sets of independent random variables 
Bi, B2, . . . , Bk and Ci, C2, . . . , Q. □ 

The correspondence between uncorrelation and independence is identical to the analogous 
property of the multivariate Gaussian distribution (Ash 2000), and is closely related to the 
strong normality defined for orthogonal second order random variables in Loeve (1977). It can 
also be applied to disjoint subsets of components of a single multivariate Bernoulli variable, 
as they are also distributed as multivariate Bernoulli random variables. 

Theorem 3. Let B = [Bi, B2, . . . , Bk]"^ be a multivariate Bernoulli random variable; then 
every random vector B* = [Bi-^_, Bi^, . . . , B^]"^ , {ii,i2, ... ,ii} {1,2, ... ,k} is a multivariate 
Bernoulli random variable. 



Proof. The marginal components of B* are Bernoulli random variables, because B is multi- 
variate Bernoulli. The new dependency structure is defined as 

P* = {Pl* : r C {ii, . . . , C {1, . . . , k}, 1*^0}, 

and uniquely identifies the probability distribution of B*. □ 

Example 1. Let's consider the trivariate Bernoulli random variable 



B 



Bi 
B2 
B3 



Bi + B2 



where 



Bi 





B2 




and 



B. 



Bi 


B3 
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Then the covariance matrix 



C0V(Bi,B2) = E 





B2 




[Bi ^3 





B2 




■ ■ 

B1B2 B2B3 




P2 





E([i?i B3]) 
[pi Pa] 



■ ■ 

P12 P23 




■ ■ 

P1P2 P2P3 




0' 

P12 - P1P2 P23 - P2P3 




of the two components Bi and B2 is equal to the zero matrix if and only if 



P12 = P1P2 



[ P23 = P2P3 

which in turn implies and is implied by Bi _LL B2. 



{Bi AL B2, B2 AL B3} 



3.2. Properties of the covariance matrix 

The covariance matrix S = [o'ij], i,j = l,...,k associated with a multivariate Bernouhi 
random vector has several interesting numerical properties. Due to the form of the central 
second order moments defined in formulas 6 and 7, the diagonal elements are bound in the 
interval 



<7i 



Pi - Pi e 



0, 



1 



(11) 



The maximum is attained for pi = ^, and the minimum for both pi = and pi = 1. For the 
Cauchy-Schwartz theorem (Ash 2000) then 
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Wii\ e 



:i2) 



The eigenvalues Ai, A2, . . . , A^ of S are similarly bounded, as shown in the following theorem. 

Theorem 4. Let B = [i3i,i?2) • • • j-Bfc]"^ he a multivariate Bernoulli random variable, and let 
S = [c^ij], i,j = 1, . . . ,k be its covariance matrix. Let Xi, i = 1, . . . ,k be the eigenvalues of 
S . Then 

k r 

O^J^A,^- (13) 

i=l 



and 



^ A,: ^ 



(14) 
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Proof. Since S is a real, symmetric, non-negative definite matrix, the eigenvalues Aj are non- 
negative real numbers (Salce 1993); this proves the lower bound in both inequalities. 

The upper bound in the first inequality holds because 

Aj = o-jj ^ max o-jj = V] maxo-jj = -, 
i=l 1=1 1=1 1=1 

as the sum of the eigenvalues is equal to the trace of S (Seber 2008). This in turn implies 

k 



Ai ^ ^ Ai ^ 



i=l 



which completes the proof. 

These bounds define a convex set in M'^, defined by the family 



V 



0, 



k' 



□ 



(15) 



where A (c) is the non-standard k — 1 simplex 



A^-\c) = ^Ai, . . . , A,.) e : ^ A, = c. A, ^ 



(16) 



3.3. Sequences of multivariate Bernoulli variables 

Let's now consider a sequence of independent and identically distributed multivariate Bernoulli 
variables Bi, B2, . . . , Bm ~ Berk{p). The sum 

m 

Sm = ^ Bj ~ Bik{m, p) (17) 
1=1 

is distributed as a multivariate Binomial random variable (Krummenauer 1998b), thus pre- 
serving one of the fundamental properties of the univariate Bernoulli distribution. A similar 
result holds for the law of small numbers, whose multivariate version states that a /c-variate 
Binomial distribution Bik{m,p) converges to a multivariate Poisson distribution Pk{A.): 

Sm -4 Pfc(A) as mp — )• A. (18) 

Both these distributions' probability functions, while tractable, are not very useful as a basis 
for explicit inference procedures. An alternative is given by the asymptotic multivariate 
Gaussian distribution defined by the multivariate central limit theorem (Ash 2000): 

S™-mE(Bi)^^^(Q^^^_ (19) 



m 



The limiting distribution is guaranteed to exist for all possible values of p, as the first two 
moments are bounded and therefore are always finite. 
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4. Inference on the network structure 

Let U = (VjE) be the undirected graph underlying the DAG G = (V,A), defined as its 
unique biorientation (Bang- Jensen and Gutin 2009). Each edge e £ E of U corresponds to 
the directed arcs in A with the same incident nodes, and has only two possible states (it's 
either present in or absent from the graph). 

Then a, i = 1, . . . , |V|(|V| — l)/2 is naturally distributed as a Bernoulli random variable 

^ _ ieie E with probability pt ^^^^ 
] Ci ^ E with probability 1 — pi 

and every set C V x V (including E) is distributed as a multivariate Bernoulli random 
variable W and identified by the parameter collection 

Pw = {Pw--wCW,w^0}. (21) 

The elements of pw can be estimated via parametric or nonparametric bootstrap as in Fried- 
man et al. (1999), because they are functions of the DAGs ^b, h = l,...,m through the 
underlying undirected graphs Uh = {V,Eb). The resulting empirical probabilities 



m 

1 

Pu 



lSlhu^QE,}{Ub), (22) 
b=i 

in particular 

^ m ^ m 

= -Hh^r^E,}{l^b) and Pij = -^\e,(iE,,e,&E,}{Ub), (23) 

b=l b=l 

can be used to obtain several descriptive measures and test statistics for the variability of the 
network's structure. 



4.1. Interpretation of bootstrapped networks 

Considering the undirected graphs Ui, . . . , U„i instead of the corresponding directed graphs 
Qi, . . . , Qm greatly simplifies the interpretation of bootstrap's results. In particular the vari- 
ability of the graphical structure can be summarized in three cases according to the entropy 
(Cover and Thomas 2006) of the set of the bootstrapped networks: 

• minimum entropy: all the networks learned from the bootstrap samples have the same 
structure, that is 

Ei = E2 = ... = Em = E. (24) 

This is the best possible outcome of the simulation, because there is no variability in 
the estimated network. In this case the first two moments of the multivariate Bernoulli 
distribution are equal to 

f 1 ifaeE 

Pi = < and E = O. (25) 

otherwise 
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intermediate entropy: several network structures are observed with different frequencies 
nih, ^ mb = rn. The first two sample moments of the multivariate Bernoulli distribution 
are equal to 

m = - Yl p., = l Yl (26) 

maximum entropy: all 21^1(1^1^^-'/^ possible network structures appear with the same 
frequency, that is 

This is the worst possible outcome because edges vary independently of each other and 
each one is present in only half of the networks (proof provided in appendix B): 

Pi = \ and S = ^4. (28) 



4.2. Descriptive statistics of network's variability 

Several functions have been proposed in literature as univariate measures of spread of a 
multivariate distribution, usually under the assumption of multivariate normality (see for 
example Muirhead (1982) and Bilodeau and Brenner (1999)). Three of them in particular 
can be used as descriptive statistics for the multivariate Bernoulli distribution: 

• the generalized variance 

VARg(S) = det(S). (29) 

• the total variance (called total variation in Mardia et al. (1979)) 

VARt(S) =tr(S). (30) 

• the squared Frobenius matrix norm 

VAR;v(S) = |||S-^4||||. (31) 

Both the generalized variance and the total variance associate high values of the statistic to 
unstable network structures, and are bounded due to the properties of the covariance matrix 
S. For the total variance it's easy to show that 

k ^ 

^ VARt(S) = ^ cTii ^ -k (32) 

i=l 

due to the bounds on the variances an in equation 11. The generalized variance is similarly 
bounded due to Hadamard's theorem on the determinant of a non-negative definite matrix 
(Seber 2008): 

0<VARG(S)^f];a,, ^ . (33) 
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They reach the respective maxima in the maximum entropy case and are equal to zero only 
in the minimum entropy case. The generalized variance is also strictly convex (the maximum 
is reached only for S = \lk), but it is equal to zero if S is rank deficient. For this reason it's 
convenient to reduce S to a smaller, full rank matrix (let's say S*) and compute VARg'(S*) 
instead of VARg(S). 

The squared Frobenius norm on the other hand associates high values of the statistic to stable 
network structures. It can be rewritten in terms of the eigenvalues Ai, . . . , of S as 



VAR^(S) = (a, - 



(34) 



It has a unique maximum (in the minimum entropy case), which can be computed as the 
solution of the constrained minimization problem in A = [Ai, . . . , A^]-^ 

nun /(A) = - '^i^i - subject to Aj ^ 0, ^ Aj ^ - (35) 

1=1 ^ ^ 1=1 

using the extended Lagrange multipliers methods (Nocedal and Wright 1999). It also has a 
single minimum in A* = [|, • . . , j], which is the projection of [|, • • • , ^] onto the set D and 
coincides with the maximum entropy case. The proof for these boundaries and the rationale 
behind the use of instead of \lk are reported in appendix A. 

The corresponding normalized statistics are: 

VARt(S) 4VARt(S) 



VARt(S) 



maxEVART(S) k 



VARg(S) = ^^^^^if]^, = 4'=VARg(S) 
maxs VARg(2j) 

maxs VARAr(E) - VAR7v(S) _ k^ - 16VARAr(S) 

maxEVARiv(S) -minsVARAr(i;) ~ k{2k - 1) ' 



VARAr(S 



All of them vary in the [0, 1] interval and associate high values of the statistic to networks 
whose structure display a high variability across the bootstrap samples. Equivalently we can 
define 



VARt(S) = 1 - VARt(S) 



VARg(S) = 1 - VARg(S) 



VAR^(S) = 1 - VAR7v(S) 

which associate high values of the statistic to networks with little variability, and can be used 
as measures of distance from the maximum entropy case. 

Example 2. Let's consider three multivariate Bernoulli distributions Wi, W2, W3 with 
second order moments 



1 

25 



6 1 
1 6 



1 

625 



66 -21 
-21 126 



and S3 = 

625 



66 91 
91 126 
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Figure 1: The covariance matrices Si, S2 and S3 represented as functions of their eigenvalues 
in D (green). The points (0,0) and (j, j) correspond to the minimum entropy and maximum 
entropy cases. 



The eigenvalues o/Si, S2 and S3 are 

0.3069" 
0.0003 

and the values of the generalized variance, total variance and squared Frohenius matrix norm 
(both normalized and in the original scale) for the three covariance matrices are reported 
below. 





VARt(S) 


VARg(S) 


VARjv(S) 


VARr(S) 


VARg(S) 


VARjv(S) 


Si 


0.48 


0.056 


0.1384 


0.96 


0.896 


0.9642 


S2 


0.3072 


0.02016 


0.2468 


0.6144 


0.32256 


0.6752 


S3 


0.3072 


8.96 X 10-5 


0.2869 


0.6144 


0.00143 


0.5682 



0.28 
0.20 



0.2121 
0.095 



4.3. Asymptotic inference 

The hmiting distribution of the descriptive statistics defined above can be derived by replacing 
the covariance matrix S with its unbiased estimator S and by considering the multivariate 
Gaussian distribution from equation 19. The hypothesis we are interested in is 

H^:T. = hk ^1 : S / ^4, (36) 

which relates the sample covariance matrix with the one from the maximum entropy case. 
For the total variance we have that (Muirhead 1982) 

tT = 4mtr(S)~xL, (37) 

and since the maximum value of tr(S) is achieved in the maximum entropy case, the hypothesis 
in 36 assumes the form 

Ho : tr(S) = ^ Hi : tr(S) < ^. (38) 
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Then the observed significance value is 

ar = P{tT ^ tD, 
and can be improved with the finite sample correction 

P{tT ^ t^'') 



aT = P {tr ^ tT ' I *T e [0, mk]) 



P{tT ^ mk) 



(39) 



(40) 



which accounts for the bounds on VAR3^(S) from inequality 32. 

For the generalized variance there are several possible asymptotic and approximate distribu- 



tions: 



the Gaussian distribution defined in Anderson (2003) 

^ det(S) \ ^ 
detilh) i " 



N{0,2k). 



(41) 



• the Gamma distribution defined in Steyn (1978) 



*G2 



mk f. det(S) 



2 Vdet(|4) 



Ga 



k{m + \ — k) 



(42) 



• the saddlepoint approximation defined in Butler et al. (1992). 
As before the hypothesis in 36 assumes the form 



Hq : det(S) = det ( ^4 



Hi : det(S) < det ( -4 



The observed significance values for the Gaussian and Gamma distributions are 

«G, = P{tG, ^ t^T) = P{tG, ^ f^dD 

and the respective finite sample corrections for the bounds on det(S) are 

Pita, ^ f^^^^) - PitG, ^ 



ac, = P {to, ^ icT I G [-\/^,0]) 

mk 
0,— 



m] 



P{tG, ^ 0) - P(tGi ^ -V^) 



(43) 
(44) 

(45) 
(46) 



The test statistic associated with the squared Frobenius norm is the test for the equality of 
two covariance matrices defined in Nagao (1973), 



m 

tN = ^ti 



-1 



Ik 



m 



tr 



4S-4 



(47) 
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Figure 2: Asymptotic significance values of tx (green), (blue) and (violet) for Si, T12 
and S3 from table 1. 



because 



tr 



4S-Ifc 



tr 



16tr U 



16 tr 
1 



UAU" - ^4 
4 



16 tr 



1 



UAU^ - -Jk 
4 



A - -^h 



U^j = 16 tr 

1=1 ^ ^ 



A- -4 



1 

4" 



16IIIS- -4|||2, (48) 



where U AU^ is the spectral decomposition of S (see appendix A for and explanation of the 
use of 1 4 instead of |4)- The significance value for tjv is 



oss\ 
N ) 



(49) 



as the hypothesis in 36 becomes 



1 



i/o:|||S--4|||F = 



Hi : Ills - -4|||f > 0. 



(50) 
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tT(S) 




10 


20 


50 


100 


200 




Si 


0.4911379 


0.4576109 


0.4054044 


0.3549436 


0.2912432 




0.906041 


0.863836 


0.7814146 


0.691495 


0.571734 






0.0941934 


0.0263308 


0.0008529 


0.0000038 


1.09 X 10- 


10 


S2 


0.1737661 


0.04970497 


0.001644116 


0.0000075 


2.14 X 10 


10 


S3 


0.0941934 


0.0263308 


0.0008529 


0.0000038 


1.09 X 10- 


10 


0.1737661 


0.04970497 


0.001644116 


0.0000075 


2.14 X 10 


10 




Si 


0.6039442 


0.5242587 


0.4231830 


0.3411315 


0.250054 




0.9052188 


0.8475223 


0.7357998 


0.6166961 


0.4651292 




0.1214881 


0.0235145 


0.0002789 


0.0000002 


2.79 X 10- 


13 


S2 


0.1820918 


0.03801388 


0.000484961 


0.00000045 


5 X 10 1^ 




S3 


3.13 X 10"^° 


2.03 X 10^20 


9.82 X 10"^^ 


4.42 X 10^1°^ 


1.26 X 10" 


201 


4.7 X 10-1° 


3.28 X 10 20 


1.7 X 10^° 


7.99 X 10-i°i 


2.35 X 10 


-201 


tjv(S) 




0.9652055 


0.9091238 


0.7149371 


0.4368392 


0.1422717 




Si 


0.9645473 


0.9091083 


0.7149371 


0.4368392 


0.1422717 


S2 


0.5649382 


0.2537627 


0.0170906 


0.0001428 


7.48 X 10- 


9 


0.556708 


0.2536360 


0.01709067 


0.0001428399 


7.48 X 10 


9 


S3 


0.1545514 


0.0147960 


0.0000085 


2.37 X 10-11 


1.34 X 10- 


22 


0.1385578 


0.01462880 


8.5 X lO""** 


2.37 X 10"" 


1.34 X 10 


-22 



Table 1: Asymptotic significance values of tx, and In for Si, S2 and S3; the ones 
computed with the finite sample corrections are reported in bold. 



Unlike the previous statistics, Nagao's test displays a very good convergence speed, to the 
point that the finite sample correction for the bounds on the squared Frobenius matrix norm 

UN -P [tN ^ I ta, G [0, J) - p^^^ ^ ^rnax-^ (51) 

is not appreciably better than the raw significance value. 

Example 3. Let's consider again the multivariate Bernoulli distributions Wi, W2, W3 and 
their covariance matrices Si, S2, S3 from example 2. The respective asymptotic significance 
values for the statistics tx, tci cind t^ are reported in table 1. 



4.4. Monte Carlo inference and parametric bootstrap 

Another approach to compute the significance values of the statistics VARx'(S), VARg(S) and 
VAR7v(S) is again parametric bootstrap. 

The multivariate Bernoulli distribution Wq specified by the hypothesis in 36 has a diagonal 
covariance matrix, so its components Wq^, i = 1, . . . , A; are uncorrelated. According to theorem 
1 they are also independent, so the joint distribution of Wq is completely specified by the 
marginal distributions Wq^ ~ Ber{^). Therefore it's possible (and indeed quite easy) to 
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Figure 3: Monte Carlo significance values for the total variance (green), the generalized 
variance (blue) and the squared Frobenius matrix norm (violet) from table 2. 

generate observations from the null distribution and use them to estimate the significance 
value of the normalized statistics VARj'(S), VARg'(S) and VAR7v(S) defined in section 4.2: 

1. compute the value of test statistic T on the original covariance matrix S. 

2. For r = 1,2, . . . 

(a) generate m sets of k random samples from a Ber{^) distribution. 

(b) compute their covariance matrix S*. 

(c) compute T* from S* 

3. compute the Monte Carlo significance value as 

1 ^ 

"« = ]^E^{->n(^r)- (52) 

r=l 

This approach has two important advantages over the parametric tests defined in section 4.3: 

• the test statistic is evaluated against the null distribution instead of its asymptotic 
approximation, thus removing any distortion caused by lack of convergence (which can 
be quite slow and problematic in high dimensions). 

• each simulation r has a lower computational cost than the equivalent application of the 
structure learning algorithm to a bootstrap sample b. Therefore the Monte Carlo test 
can achieve a good precision with a smaller number of bootstrapped networks, allowing 
its application to larger problems. 
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VARt($]) 




10 


20 50 


100 


200 


Si 


0.569655 


0.457109 0.129242 


0.017416 


0.000334 




0.016834 


0.000205 








S3 


0.016834 


0.000205 








VARg(S) 


Si 


0.784102 


0.512839 0.14788 


0.013678 


0.000094 


S2 


0.063548 


0.000761 








S3 


0.005909 


0.000008 








VARjv(S) 


Si 


0.743797 


0.568819 0.239397 


0.096544 


0.019633 


S2 


0.196996 


0.037772 0.001018 


0.000005 





S3 


0.018292 


0.000355 









Table 2: Bootstrap significance values from parametric bootstrap for Si, S2 and S3. 

Example 4. Let's consider the multivariate Bernoulli distributions Wi, W2, W3 from ex- 
amples 2 and 3 one last time. The corresponding significance values for the three normalized 
statistics VAR7^(S), VARg'(S) and VAR7v(S) are reported in table 2 for various sizes of the 
bootstrap samples (m = 10,20,50,100,200). Each one have been computed from R = 10^ 
covariance matrices generated from the null distribution. The code used for the simulation is 
reported in appendix C. 



5. Conclusions 

In this paper we derived the properties of several measures of variability for the structure 
of a Bayesian network through its underlying undirected graph, which is assumed to have a 
multivariate Bernoulli distribution. Descriptive statistics, asymptotic and Monte Carlo tests 
were developed along with their fundamental properties. They can be used to compare the 
performance of different learning algorithms and to measure the strength of any arbitrary 
subset of arcs. 
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Appendix 



A. Bounds on the squared Frobenius matrix norm 

The squared Frobenius matrix norm of the difference between the covariance matrix S and 
the maximum entropy matrix is 

m-\ml = y:[>^^-i] ■ (53) 

Its unique global minimum is 



i=i ^ ^ 



\\\^-\h\\\\ = ^ (54) 

for S = due to the fundamental properties of the matrix norms (Salce 1993). However, 
it has a varying number of global maxima depending on the dimension of S. They are the 
solutions of the constrained minimization problem 



inin /(A) = — ( "^^ ~ 4 ) subject to Aj ^ 0, Aj ^ - (55) 

i=\ ^ ' 1=1 



and can be computed from the Lagrangian equation and its derivatives 



^ ( l\^*^ ( k ^ 

i=l ^ ^ j=l \ 1=1 



(56) 



S 1 

--CiX,s) = -2Xi + -- Si + Sk+1 (57) 
OAi 2 

where s = [si. . . . , s^+i]^ are the Lagrangian multipliers. This configuration of stationary 
points does not influence the results based on the asymptotic distribution of the multivari- 
ate Bernoulli distribution, but prevents any direct interpretation of quantities based on this 
difference in matrix norm as descriptive statistics. 

On the other hand the difference in squared Frobenius norm 



VAR^r(S) = |||S-^/fc|||| = ^(A.-^)' 

i=i ^ ^ 

has both a unique global minimum (because it's a convex function) 



(59) 



mmVAR^(S) = VAR^ ( j = 2^ ( | " ^) = ^^^^^ (60) 



and a unique global maximum 



maxVARw(S) = VAR^(0) = ^('^') =^ (61) 



Marco Scutari 



17 




Figure 4: Squared Frobenius matrix norms from ^Ik (on the left) and jlk (on the right) in 
T> for k = 2. The green area is the set P of the possible eigenvalues of S and the red lines are 
level curves. 

which correspond to the minimum entropy (A = [0, . . . , 0]) and the maximum entropy (A = 
[i? ■ • • ) i]) covariance matrices respectively (see figure 4). However since is not a valid 
covariance matrix for a multivariate Bernoulli distribution, \/AR]\f(T,) cannot be used to derive 
any probabilistic result. 

B. Multivariate Bernoulli and the maximum entropy case 

Let's first state a simple theorem on the probability of one and two edges in the maximum 
entropy case. 

Theorem 5. Let Ui, . . . ,lAn, n = 2*", m = |V|(|V| — l)/2 he all possible undirected graphs 
with vertex set V and let 

P{Uk) = - fe = l,...,n. (62) 
n 

Let Ci and Cj, i ^ j be two edges in V x V. Then 

P(ej) = ^ and P{ei,ej) = ^. (63) 

Proof. The number of possible configurations of an undirected graph is given by the Cartesian 
product of the possible states of its m edges, resulting in 

|{0,l}x...x{0,l}| = |{0,in=2- (64) 

possible undirected graphs. Then edge Cj is present in 

|{0, 1} X . . . X 1 X . . . X {0, 1}| = |l X {0, = 2™-i (65) 
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graphs and Cj and ej are simultaneously present in 

|{0, 1} X . . . X 1 X 1 X . . . X {0, 1}| = |l2 X {0, l}'"-^! = 2"^-2 (66) 
graphs. Therefore 



□ 



Then the values of pi and S = [cjjj] in equation 28 are indeed: 



E(e.)=P. = ^ (68) 

VAR(ei) = au=p^-pj = l-\ = \ (69) 

COV(ei, Cj) = aij = pij - pipj = ^ - ^ • ^ = 0. (70) 

The fact that aij = for every i ^ j also proves that the edges are independent according to 
theorem 1. 



C. R code for the parametric bootstrap simulation 

The following R function has been used to compute the significance values in example 4. 

biv.ber.sim = f unctionCsigma, B, R, test) { 

if (test == "vart") 

FUN = functiondambda) 1/2 - sum(lambda) 
else if (test == "varg") 

FUM = functiondambda) 1/16 - prod(lambda) 
else if (test == "varn") 

FUN = functiondambda) sum( (lambda - 1/4) "2) 

Sim = matrix (OL, nrow = B, ncol = 2) 
tstar = numeric (R) 

sO = eigen(sigma)$values 
to = FUN(sO) 

for (i in 1:R) { 

for (j in 1:B) 

sim[j, ] = rbinom(2, 1, 1/2) 

p = prop.table(table(f actor(sim[, 1], levels = c(0,l)), 
factor(sim[, 2], levels = c(0,l)))) 

sigmastar = matrix ( 

c(sum(p[2,]) * (1 - sum(p[2,])), 

p[2,2] - sum(p[,2])*sum(p[2,]) , 

p[2,2] - sum(p[,2])*sum(p[2,]) , 

sum(p[,2]) * (1 - sum(p[,2]))), 
nrow = 2, ncol = 2, byrow = TRUE) 



sstar = eigen(sigmastar) $values 
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tstar[i] = FUN(sstar) 

> 

return(length(tstar [tstar >= tO])/R) 

} 

The three covariance matrices Si, S2 and S3 have been created in R with the foUowing 
commands. 



sigmal = inatrix(c(6, 1, 1, 6)/25, ncol = 2) 

sigma2 = matrix (c (66, -21, -21, 126)/625, ncol = 2) 

sigma3 = matrix (c (66, 91, 91, 126)/625, ncol = 2) 

All the simulations have been performed on a Core Duo 2 machine with 1GB of RAM, with 
R 2.9.0 (R Development Core Team 2009) and an updated Debian GNU/Linux distribution. 

D. R code for the asymptotic inference 

total . variance = f unction(sigma, b, adjusted = FALSE) { 

res = pchisq(4 * b * sum(diag(sigma) ) , 2 * b, lower. tail = TRUE) 
if (adjusted) 

res = res / pchisq(2 * b, 2 * b, lower. tail = TRUE) 
return(res) 

} 

generalized. variance = function(sigma, b, adjusted = FALSE) { 

res = pgaiimia(4 * b * sqrt (det (sigma) ) , b - 1, 1, lower, tail = TRUE) 

if (adjusted) 

res = res / pgaiiima(b, b - 1, 1, lower. tail = TRUE) 

return(res) 

} 

f robenius .norm = f unction(sigma, b, adjusted = FALSE) { 

res = pchisq(8 * b * sum( (eigen(sigma) lvalues - 1/4 )"2), 3, lower. tail = FALSE) 
if (adjusted) 

res = (res - pchisq(b, 3, lower. tail = FALSE)) / pchisq(b, 3, lower. tail = TRUE) 
return(res) 

} 
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